@author : Kevin KURTZ
@email : kevin.kurtz@efrei.net

Machine Learning¶

Project: Diabetes classification Using Machine Learning¶

This project aims to classify whether a patient has diabetes based on several health metrics, such as glucose level, blood pressure, BMI, and more. We use the Pima Indians Diabetes Dataset for this analysis, which includes 768 patient records. Through machine learning models, we try to predict the Outcome (1 for diabetic, 0 for non-diabetic) based on the provided health features.

Dependancies :¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import plot_tree
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

Dataset :¶

Let's start by loading the Pima Indians Diabetes Dataset. For this, I use my own github repository to load the dataset.

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/ktzkvin/Diabetes-Prediction-ML/refs/heads/main/data-diabetes.csv")

# ---- Another way is to load the dataset from a local file :
# df = pd.read_csv("data-diabetes.csv")

1. Data Preprocessing¶

1.1 Analyze the dataset to understand the distribution of classes:¶

In [3]:
# Check number of rows and columns
print("Number of rows: ", df.shape[0])
print("Number of columns: ", df.shape[1])

# Display basic statistics
df.describe()
Number of rows:  768
Number of columns:  9
Out[3]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
In [8]:
# Show general info about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

As we can see, the dataset has 768 rows and 9 columns. Also, it seems there are no missing values in the dataset.

In [11]:
%matplotlib inline

class_counts = df['Outcome'].value_counts()

ax = sns.countplot(x='Outcome', data=df)
plt.title('Distribution of Diabetic (1) vs Non-Diabetic (0)')
plt.xlabel('Class (0: Non-Diabetic, 1: Diabetic)')
plt.ylabel('Count')

# Adding the percentage for each bar
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width() / 2., height + 3, f'{int(height)}', ha="center")

colors = ['palegreen', 'salmon']
plt.bar(class_counts.index, class_counts, color=colors)

plt.show()
No description has been provided for this image

As we can see, in this dataset, there are 500 Non-Diabetic patients and 268 Diabetic patients. This means that the dataset is a bit imbalanced (65% // 35% Diabetic).

Dataset Profiling with ydata-profiling¶

To get a quick overview of the dataset, we use ydata-profiling, which generates an interactive report with detailed statistics for each feature. Below is the profiling report for the Pima Indians Diabetes Dataset:

In [12]:
from ydata_profiling import ProfileReport 
 
profile = ProfileReport(df, title="Pima Indians Diabetes Dataset Profiling Report") 
profile.to_notebook_iframe()
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

1.2 Handle missing values (replacing 0 by median values):¶

Why use the median to replace 0 values?¶

In medical datasets like this one, certain variables such as Glucose, BloodPressure, SkinThickness, Insulin, and BMI should not have a value of 0. These zeros likely represent missing or incorrect data rather than actual measurements. For instance, it's impossible for a living person to have a blood pressure of 0.

To handle this, we choose to replace these "0" values with the median. The median is more robust to outliers than the mean, which makes it a better choice in medical contexts where extreme values could distort the average. Using the median helps maintain the accuracy of the dataset without introducing unnecessary bias from outliers. Also, deleting these rows is not a good option because it would result in a significant loss of data, since there are only 768 rows in the dataset.

Additionally, we calculate separate medians for Outcome 0 (non-diabetic) and Outcome 1 (diabetic). This ensures that the data imputation is more accurate and reflects the underlying distribution of each class. This approach is more effective than using a single median for the entire dataset.

In [9]:
# Splitting the dataset based on the 'Outcome' (0 = No Diabetes, 1 = Diabetes)
df_outcome_0 = df[df['Outcome'] == 0]
df_outcome_1 = df[df['Outcome'] == 1]

# Calculating the medians for each Outcome group
median_outcome_0 = df_outcome_0[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].median()
median_outcome_1 = df_outcome_1[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].median()

# Function to replace "0" values with the median based on Outcome
def impute_with_median(row, median_0, median_1):
    if row['Outcome'] == 0:
        for col in median_0.index:
            if row[col] == 0:
                row[col] = median_0[col]
    else:
        for col in median_1.index:
            if row[col] == 0:
                row[col] = median_1[col]
    return row

# Apply the function to the dataset
df = df.apply(impute_with_median, axis=1, args=(median_outcome_0, median_outcome_1))

# Check the dataset again after imputation
df.describe()
Out[9]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.00000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 121.677083 72.378906 27.43099 91.783854 32.433919 0.471876 33.240885 0.348958
std 3.369578 30.464161 12.104431 9.32146 108.121136 6.880664 0.331329 11.760232 0.476951
min 0.000000 44.000000 24.000000 7.00000 0.000000 18.200000 0.078000 21.000000 0.000000
25% 1.000000 99.750000 64.000000 21.00000 39.000000 27.500000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 27.00000 39.000000 32.050000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.00000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.00000 846.000000 67.100000 2.420000 81.000000 1.000000

1.3 Encode categorical features appropriately (verifying all columns are numerical):¶

In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    float64
 1   Glucose                   768 non-null    float64
 2   BloodPressure             768 non-null    float64
 3   SkinThickness             768 non-null    float64
 4   Insulin                   768 non-null    float64
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    float64
 8   Outcome                   768 non-null    float64
dtypes: float64(9)
memory usage: 54.1 KB

All features are already numerical, so no encoding is necessary.

2. Feature Selection¶

As part of the feature selection, we could, in a first way, check which columns we could remove from the dataset, thanks to the correlation matrix.
In a second way, we could use the PCA method to reduce the number of features.

2.1 Correlation Matrix:¶

We can now create a correlation matrix to analyze the relationships between features and identify potential redundant variables.

In [7]:
# Correlation Matrix
corr_matrix = df.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='crest', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of the Dataset')

plt.show()
No description has been provided for this image

The graph above shows the correlation matrix of the features in the dataset.

Some important points:

  1. Glucose has a good correlation with Outcome (0.47). This means higher glucose levels are linked to a higher chance of diabetes.

  2. Age is strongly correlated with Pregnancies (0.54), which makes sense since older women have had more pregnancies.

  3. BMI and SkinThickness are somewhat correlated (0.39), showing they might have some overlap in information.

  4. Other features like BloodPressure and Insulin have lower correlations with Outcome, meaning they might not be as important for predicting diabetes.

Despite some lower correlations, we will keep all features because each feature carries important medical information.


Why don't we use PCA ?¶

While PCA is a useful technique for reducing the number of features in a dataset, it's not necessary in this case because:

  • Few features: We only have 8 features, so we don’t need to reduce them. PCA is useful when we have more features, like 100 or +.

  • Keeping the meaning: PCA mixes the features together, which makes it harder to understand. In medical data, it is important to know exactly which feature (like glucose or blood pressure) is causing diabetes.

  • Not complex enough: PCA is not likely to make the model much better because the dataset is not very complex. It would also make the model harder to explain.


2.2 Normalize the dataset:¶

However, we could already make the normalization of the data to have a better performance of the model, and since the dataset contains variables with diff scales, it's necessary to normalize them before training models.

In [13]:
# Normalize the dataset (except for 'Outcome')
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df.iloc[:, :-1]), columns=df.columns[:-1])

# Keep the 'Outcome' column in the scaled dataframe
df_scaled['Outcome'] = df['Outcome']

df_scaled.describe()
Out[13]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 7.680000e+02 7.680000e+02 7.680000e+02 7.680000e+02 7.680000e+02 7.680000e+02 7.680000e+02 7.680000e+02 768.000000
mean -8.789266e-17 1.665335e-16 9.251859e-18 1.237436e-16 -7.256927e-17 3.777071e-15 1.905883e-15 2.081668e-16 0.348958
std 1.000652e+00 1.000652e+00 1.000652e+00 1.000652e+00 1.000652e+00 1.000652e+00 1.000652e+00 1.000652e+00 0.476951
min -1.141852e+00 -2.551447e+00 -3.999398e+00 -2.193251e+00 -8.494516e-01 -2.070032e+00 -1.189553e+00 -1.041549e+00 0.000000
25% -8.448851e-01 -7.202356e-01 -6.926692e-01 -6.903618e-01 -4.885100e-01 -7.175375e-01 -6.889685e-01 -7.862862e-01 0.000000
50% -2.509521e-01 -1.536274e-01 -3.132350e-02 -4.626640e-02 -4.885100e-01 -5.583320e-02 -3.001282e-01 -3.608474e-01 0.000000
75% 6.399473e-01 6.100618e-01 6.300222e-01 4.904798e-01 3.282361e-01 6.058711e-01 4.662269e-01 6.602056e-01 1.000000
max 3.906578e+00 2.539814e+00 4.102087e+00 7.682879e+00 6.980205e+00 5.041471e+00 5.883565e+00 4.063716e+00 1.000000

3. Model Selection¶

Now, we can start building the machine learning model. We will use the following algorithms:

  1. Decision Tree

  2. Logistic Regression

  3. KNN

  4. Multi-layer Perceptron (MLP)


Preparing the dataset for training¶

Before building our machine learning models, we need to split the dataset into features (X) and the target variable (y). Here, X contains all the input features (everything except 'Outcome') and y contains the Outcome, which indicates if a person has diabetes (1) or not (0).

In [96]:
# Prepare the dataset
X = df_scaled.drop(columns='Outcome')
y = df_scaled['Outcome']

After separating the features and target, we split the dataset into training (80%) and testing (20%) sets using train_test_split. This step is crucial because the model needs to be trained on one part of the data and tested on another to evaluate its performance.

In [97]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3.1 Decision Tree:¶

We use Decision Trees because they are easy to understand. This model works by making decisions step by step, just like answering yes/no questions. It’s good for this dataset because it can handle both numeric and categorical data, and it helps us know which features are the most important for predicting diabetes.

In [98]:
# Decision Tree
clf_tree = DecisionTreeClassifier()
clf_tree.fit(X_train, y_train)
y_pred_tree = clf_tree.predict(X_test)

3.2 Logistic Regression:¶

Logistic Regression is simple but effective for binary classification problems like predicting if someone has diabetes (yes/no). This model is easy to use and understand, and it shows us the relationship between each feature (like glucose or age) and the outcome (diabetes or not). However, it assumes a linear relationship, which can make it less powerful than more complex models.

In [99]:
# Logistic Regression
clf_lr = LogisticRegression(max_iter=1000)
clf_lr.fit(X_train, y_train)
y_pred_lr = clf_lr.predict(X_test)

3.3 KNN:¶

KNN is also a simple model that works well for classification problems. It works by finding the most similar data points in the training set (the nearest neighbors) and making a prediction based on their classes. KNN is good for this dataset because it can capture patterns that are not easily captured by other models.

In [100]:
# KNN
clf_knn = KNeighborsClassifier(n_neighbors=5)
clf_knn.fit(X_train, y_train)
y_pred_knn = clf_knn.predict(X_test)

3.4 Multi-layer Perceptron (MLP):¶

MLP is another powerful model which can learn complex patterns. I think it could be interesting to use it because if can handle non-linear relationships between our features and our outcomes.

In [101]:
# MLP
clf_mlp = MLPClassifier(hidden_layer_sizes=(50, 50), max_iter=1000)
clf_mlp.fit(X_train, y_train)
y_pred_mlp = clf_mlp.predict(X_test)

3.5 Random Forest:¶

I guess that Random Forest could be our best model for this dataset since it is an ensemble method that combines multiple decision trees to create a more accurate model. It can capture complex relationships in the data, handle missing values and is not sensitive to outliers and where we could change the hyperparameters to improve the model.

In [102]:
# Random Forest
clf_rf = RandomForestClassifier()
clf_rf.fit(X_train, y_train)
y_pred_rf = clf_rf.predict(X_test)

4. Model Evaluation¶

4.1 Decision Tree:¶

In [120]:
# Decision Tree
print("Decision Tree - Accuracy:", accuracy_score(y_test, y_pred_tree))
print("Decision Tree - Classification Report:\n\n", classification_report(y_test, y_pred_tree))
Decision Tree - Accuracy: 0.8506493506493507
Decision Tree - Classification Report:

               precision    recall  f1-score   support

         0.0       0.89      0.88      0.88        99
         1.0       0.79      0.80      0.79        55

    accuracy                           0.85       154
   macro avg       0.84      0.84      0.84       154
weighted avg       0.85      0.85      0.85       154

  • Accuracy: 85.06%

The Decision Tree model performed well with an accuracy of 85.06%. It shows a good balance between precision and recall for both classes, especially for the non-diabetic class (0.90 precision and 0.89 recall).
Overall, it outperforms most models and ranks second in terms of accuracy in this project.

Decision Tree Visualization¶

In [104]:
# Plot the decision tree
plt.figure(figsize=(20, 10))
plot_tree(clf_tree, filled=True, fontsize=10)

plt.show()
No description has been provided for this image

As we can see, the decision tree is a bit messy and hard to read. It's difficult to understand the rules and decisions made by the model. This is why we will explore how to modify hyperparameters, especially using GridSearch, in the 5th section.

Confusion Matrix for Decision Tree¶

In [123]:
cm_tree = confusion_matrix(y_test, y_pred_tree)
disp_tree = ConfusionMatrixDisplay(confusion_matrix=cm_tree, display_labels=[0, 1])

# Plot the confusion matrix
disp_tree.plot(cmap='Blues')
plt.title('Confusion Matrix - Decision Tree')

plt.show()
No description has been provided for this image

The Decision Tree model performed well in identifying both diabetic and non-diabetic patients.
The model correctly classified 87 non-diabetic patients and 44 diabetic patients. However, there are 12 false positives (non-diabetic patients classified as diabetic) and 11 false negatives (diabetic patients classified as non-diabetic). The balance between false positives and false negatives shows the model's good performance, but it can still be improved.


4.2 Logistic Regression:¶

In [105]:
# Logistic Regression
print("Logistic Regression - Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Logistic Regression - Classification Report:\n\n", classification_report(y_test, y_pred_lr))
Logistic Regression - Accuracy: 0.7792207792207793
Logistic Regression - Classification Report:

               precision    recall  f1-score   support

         0.0       0.82      0.85      0.83        99
         1.0       0.71      0.65      0.68        55

    accuracy                           0.78       154
   macro avg       0.76      0.75      0.76       154
weighted avg       0.78      0.78      0.78       154

  • Accuracy: 77.92%

The Logistic Regression model achieved an accuracy of 77.92%, which is lower than the Decision Tree. While it performs decently for the non-diabetic class, with a precision of 0.82, it struggles with the diabetic class, where precision drops to 0.71. This indicates that Logistic Regression may not capture the complexity of this dataset as well as some other models.

Confusion Matrix for Logistic Regression¶

In [125]:
cm_lr = confusion_matrix(y_test, y_pred_lr)
disp_lr = ConfusionMatrixDisplay(confusion_matrix=cm_lr, display_labels=[0, 1])

# Plot the confusion matrix
disp_lr.plot(cmap='Blues')
plt.title('Confusion Matrix - Logistic Regression')

plt.show()
No description has been provided for this image

The Logistic Regression model had some difficulty distinguishing between the two classes.
It misclassified 15 non-diabetic patients as diabetic (false positives) and 19 diabetic patients as non-diabetic (false negatives). Although it correctly classified 84 non-diabetic and 36 diabetic patients, the higher number of misclassifications, especially for diabetic patients, suggests that this model may not capture the full complexity of the data as other models.


4.3 KNN:¶

In [106]:
# KNN
print("KNN - Accuracy:", accuracy_score(y_test, y_pred_knn))
print("KNN - Classification Report:\n\n", classification_report(y_test, y_pred_knn))
KNN - Accuracy: 0.7922077922077922
KNN - Classification Report:

               precision    recall  f1-score   support

         0.0       0.83      0.86      0.84        99
         1.0       0.73      0.67      0.70        55

    accuracy                           0.79       154
   macro avg       0.78      0.77      0.77       154
weighted avg       0.79      0.79      0.79       154

  • Accuracy: 79.22%

The KNN model performs slightly better than Logistic Regression with an accuracy of 79.22%. It has a good balance between precision and recall, but, like Logistic Regression, it still struggles with the diabetic class, where precision is 0.73. However, the model ranks fourth overall in accuracy, showing a decent but not the best performance.

Confusion Matrix for KNN¶

In [126]:
cm_knn = confusion_matrix(y_test, y_pred_knn)

disp_knn = ConfusionMatrixDisplay(confusion_matrix=cm_knn, display_labels=[0, 1])
disp_knn.plot(cmap='Blues')
plt.title('Confusion Matrix - KNN')

plt.show()
No description has been provided for this image

The KNN model did a bit better than Logistic Regression, with 85 non-diabetic patients correctly classified and 37 diabetic patients correctly identified.
However, it still had 14 false positives and 18 false negatives. This balance shows that while KNN captures some patterns in the data, it struggles to classify diabetic patients as well as non-diabetic ones, indicating potential limitations for this particular dataset.


4.4 Multi-layer Perceptron (MLP):¶

In [118]:
# MLP
print("MLP Classifier - Accuracy:", accuracy_score(y_test, y_pred_mlp))
print("MLP Classifier - Classification Report:\n\n", classification_report(y_test, y_pred_mlp))
MLP Classifier - Accuracy: 0.8376623376623377
MLP Classifier - Classification Report:

               precision    recall  f1-score   support

         0.0       0.87      0.88      0.87        99
         1.0       0.78      0.76      0.77        55

    accuracy                           0.84       154
   macro avg       0.82      0.82      0.82       154
weighted avg       0.84      0.84      0.84       154

  • Accuracy: 84.42%

The MLP model performs well with an accuracy of 83.77%, which places it third after the Decision Tree. The MLP shows strong performance for both classes, with good precision and recall scores. The slightly lower accuracy compared to the Decision Tree may be due to the complexity of tuning neural networks, but overall, it is one of the stronger models in this comparison.

Confusion Matrix for MLP¶

In [127]:
cm_mlp = confusion_matrix(y_test, y_pred_mlp)

disp_mlp = ConfusionMatrixDisplay(confusion_matrix=cm_mlp, display_labels=[0, 1])
disp_mlp.plot(cmap='Blues')
plt.title('Confusion Matrix - MLP')

plt.show()
No description has been provided for this image

The MLP model performed similarly to the Decision Tree, correctly identifying 87 non-diabetic and 42 diabetic patients.
However, it still misclassified 12 non-diabetic patients (false positives) and 13 diabetic patients (false negatives). The MLP model does well but could reduce these misclassifications, especially in identifying diabetic patients.


4.5 Random Forest:¶

In [108]:
# Random Forest
print("Random Forest - Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Random Forest - Classification Report:\n", classification_report(y_test, y_pred_rf))
Random Forest - Accuracy: 0.8831168831168831
Random Forest - Classification Report:
               precision    recall  f1-score   support

         0.0       0.91      0.91      0.91        99
         1.0       0.84      0.84      0.84        55

    accuracy                           0.88       154
   macro avg       0.87      0.87      0.87       154
weighted avg       0.88      0.88      0.88       154

  • Accuracy: 88.31%

The Random Forest model achieved an accuracy of 87.66%, which is the highest among all the models tested so far. It performs exceptionally well for the non-diabetic class (0), with a recall of 0.93, meaning it correctly identifies the majority of non-diabetic patients. Overall, Random Forst ranks first in terms of accuracy and general performance for this project.

Confusion Matrix for Random Forest¶

In [128]:
cm_rf = confusion_matrix(y_test, y_pred_rf)

disp_rf = ConfusionMatrixDisplay(confusion_matrix=cm_rf, display_labels=[0, 1])
disp_rf.plot(cmap='Blues')
plt.title('Confusion Matrix - Random Forest')

plt.show()
No description has been provided for this image

The Random Forest model showed the best performance, with 90 non-diabetic patients and 46 diabetic patients correctly classified.
It had the fewest misclassifications, with 9 false positives and 9 false negatives. This demonstrates that Random Forest was the most effective model for this dataset, showing a good balance between precision and recall for both classes and suggesting that it captures complex patterns in the data better than the other models.


Feature Importance with Random Forest¶

Since the Random Forest model performed the best, we can analyze the feature importance to understand which variables have the most impact on predicting diabetes. This information can be valuable for medical professionals to identify key risk factors for diabetes.

In [109]:
# Feature Importance analysis
rf_feature_importance = clf_rf.feature_importances_
features = X_train.columns

sorted_idx_rf = np.argsort(rf_feature_importance)
sorted_features_rf = pd.DataFrame({'Feature': features[sorted_idx_rf], 'Importance': rf_feature_importance[sorted_idx_rf]})

plt.figure(figsize=(10, 6))
plt.barh(sorted_features_rf['Feature'], sorted_features_rf['Importance'], align='center')
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.title("Random Forest - Feature Importance (sorted)")

plt.show()
No description has been provided for this image

The bar chart above shows the feature importance of each variable according to the Random Forest model.

Some observations:

  • Insulin is by far the most important feature, with an importance score more than 0.35. This shows that insulin levels are a critical factor in predicting diabetes.

  • Glucose is the second most important feature, with a score between 0.15 and 0.20. This confirms the strong correlation we saw earlier between glucose levels and diabetes.

  • SkinThickness is also an important feature, with a score around 0.15.

  • Features like Age, BMI and DiabetesPedigreeFunction have lower importance scores, but they still contribute to the model's predictive power.

  • Intriguingly, Pregnancies and BloodPressure have the lowest importance scores.

5. Hyperparameter Tuning¶

In this section, we will change the hyperparameters of our Decision Tree and Random Forest models to see if we can still improve their performance.
But yes, we can change them manually like in this example:

In [143]:
# Hyperparameter tuning for Random Forest
rf = RandomForestClassifier(random_state=42, n_estimators=100, max_features=3)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

print("Random Forest (with hyperparameters) - Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Random Forest (with hyperparameters) - Classification Report:\n\n", classification_report(y_test, y_pred_rf))

# Plot the confusion matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
disp_rf = ConfusionMatrixDisplay(confusion_matrix=cm_rf, display_labels=[0, 1])
disp_rf.plot(cmap='Blues')
plt.title('Confusion Matrix - Random Forest (hyperparamètres ajustés)')

plt.show()
Random Forest (with hyperparameters) - Accuracy: 0.8896103896103896
Random Forest (with hyperparameters) - Classification Report:

               precision    recall  f1-score   support

         0.0       0.89      0.94      0.92        99
         1.0       0.88      0.80      0.84        55

    accuracy                           0.89       154
   macro avg       0.89      0.87      0.88       154
weighted avg       0.89      0.89      0.89       154

No description has been provided for this image

Where we can a very good result, but only with the luck to find the best hyperparameters.
Then, a better way to find the best hyperparameters is to use GridSearch, which will test different combinations of hyperparameters to find the best one for us.

5.1 GridSearch for Decision Tree:¶

In [129]:
# Hyperparameter Tuning for Decision Tree
param_grid_tree = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10]
}

grid_search_tree = GridSearchCV(DecisionTreeClassifier(), param_grid_tree, cv=5)
grid_search_tree.fit(X_train, y_train)

# Best hyperparameters for Decision Tree
print("Best Decision Tree Parameters:", grid_search_tree.best_params_)

y_pred_tree_opt = grid_search_tree.predict(X_test)
print("Optimized Decision Tree Accuracy:", accuracy_score(y_test, y_pred_tree_opt))
print("Optimized Decision Tree Classification Report:\n\n", classification_report(y_test, y_pred_tree_opt))
Best Decision Tree Parameters: {'max_depth': 3, 'min_samples_split': 2}
Optimized Decision Tree Accuracy: 0.8506493506493507
Optimized Decision Tree Classification Report:
               precision    recall  f1-score   support

         0.0       0.89      0.88      0.88        99
         1.0       0.79      0.80      0.79        55

    accuracy                           0.85       154
   macro avg       0.84      0.84      0.84       154
weighted avg       0.85      0.85      0.85       154

Something interesting to note is that we have exactly the same accuracy as before, but let's plot this Tree again:

In [144]:
# Re-plot the Decision Tree after tuning hyperparameters
plt.figure(figsize=(20, 10))
plot_tree(grid_search_tree.best_estimator_, filled=True, fontsize=10)
plt.title('Optimized Decision Tree')

plt.show()
No description has been provided for this image

Finally, this Decision Tree is less complex and easier to understand than the first one. This result suggests that while the complexity has been reduced, the model's performance has been maintained, making it more efficient and interpretable (also the time to plot the tree is less important).

5.2 GridSearch for Random Forest:¶

In [145]:
# Hyperparameter Tuning with GridSearch

# Define the hyperparameters to search
param_grid = {
    'n_estimators': [50, 100],  # Number of trees in the forest
    'max_depth': [10, 20],  # Maximum depth of the tree
    'min_samples_split': [2, 5],  # Minimum number of the samples required to split on the node
    'min_samples_leaf': [1, 2],  # Minimum number of the samples required to be at a leaf node
    'max_features': ['sqrt', 'log2'],  # Which features to consider at each split
    'bootstrap': [True]
}

rf_model = RandomForestClassifier(random_state=42)
grid_search_rf = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2, scoring='accuracy')  # n_jobs=-1 to use all processors

grid_search_rf.fit(X_train, y_train)

print(f"The Best Hyperparameters: {grid_search_rf.best_params_}")

# Evaluate the optimized model
best_rf = grid_search_rf.best_estimator_
y_pred_rf_best = best_rf.predict(X_test)

print(f"Optimized Random Forest Accuracy: {accuracy_score(y_test, y_pred_rf_best)}")
print("Classification Report:\n\n", classification_report(y_test, y_pred_rf_best))

# Plot the confusion matrix
cm_rf_best = confusion_matrix(y_test, y_pred_rf_best)
disp_rf_best = ConfusionMatrixDisplay(confusion_matrix=cm_rf_best, display_labels=[0, 1])
disp_rf_best.plot(cmap='Blues')
plt.title('Confusion Matrix - Optimized Random Forest')

plt.show()
Fitting 5 folds for each of 32 candidates, totalling 160 fits
The Best Hyperparameters: {'bootstrap': True, 'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}
Optimized Random Forest Accuracy: 0.8896103896103896
Classification Report:

               precision    recall  f1-score   support

         0.0       0.90      0.93      0.92        99
         1.0       0.87      0.82      0.84        55

    accuracy                           0.89       154
   macro avg       0.88      0.87      0.88       154
weighted avg       0.89      0.89      0.89       154

No description has been provided for this image

After tuning, the accuracy increased slightly from 88.31% to 88.96%.
The number of false positives decreased but the number of false negatives increased slightly.


Let's finish by explaining 2 points:¶

Why Did Random Forest Perform Best?

Random Forest performed the best because it is an ensemble model of multiple Decision Trees. This means it uses many of them together, instead of just one. Each tree gives a different prediction, and the model chooses the majority vote. This helps Random Forest make better decisions because it learns from different parts of the data. It also handles missing data and outliers (extreme values) better than other models, even if we replaced most of them with the median.

Why Did Logistic Regression Perform Worse?

Logistic Regression did not perform as well because it assumes a simple relationship between the features and the outcome (diabetes or no diabetes). This relationship is linear, which means it thinks that the features add up in a straight line to predict the result. But, in real life, the data can be more complex and non-linear. This is why models like Random Forest and Decision Trees, which can handle more complex patterns, do better on this dataset.


Conclusion¶

In conclusion, we tested multiple models, including Decision Trees, Logistic Regression, KNN, MLP, and Random Forest. Random Forest provided the best performance with an accuracy of 88.96% after hyperparameter tuning. We also demonstrated how to optimize models through GridSearch, which helped us improve the accuracy a bit and also how to reduce errors and complexity.
Ultimately, Random Forest was the most effective model, showing strong precision and recall for both diabetic and non-diabetic classes, making it the best choice for this dataset.